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Abstract 

We present a Coding Theorem Method involving various notions from com- 
puter science, particularly the concept of algorithmic probability and univer- 
sal search, that together provide a solution to the problem of approximating 
the Kolmogorov complexity of a string — alternative to lossless compression 
algorithms. This method and the traditional compression method comple- 
ment each other, being serviceable for different string lengths. We provide a 
thorough analysis and exact complexity values for all X)n=i ^™ Drn ary strings 
of length n < 12 and for most strings of length 12 < n < 16 by running all 
the ~ 2.5 x 10 13 (with reduction techniques 8 x 22 9 ), Turing machines with 
5 states to calculate an output frequency distribution. We also address the 
question of the stability of our approach, that is the sensitivity of K to the 
continuation of the application of the method for larger coverage and better 
accuracy, with numerical evidence suggesting robustness. Just as for com- 
pression algorithms, this work promises to deliver a full range of applications, 
and to provide insight into the question of randomness for finite strings. 

Keywords: Kolmogorov complexity; Solomonoff induction, algorithmic 
probability; Levin's Universal Distribution; Coding theorem; Invariance 
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1. Introduction 



The evaluation of the complexity of strings is key in many areas of science. 
For example, organization, structure, simplicity and randomness are all no- 
tions used to describe biological systems and functions, and are tools needed 
in bioinformatics for DNA sequence analysis, genotype and phenotype maps, 
and the study of regulatory networks. 

In all the domains where formal mathematical definitions of complexity 
are needed, the field of algorithmic information theory (AIT) has been largely 
ignored as a potential source of applications. However, when researchers have 
chosen to apply the theory, which in principle was not supposed to be of any 
practical use j6], it has proven to be of great value, for example for genetic 
sequence analysis for DNA false positive repeat sequence detection [TS], dis- 
tance measures and classification methods [7], and many others applications 
|17j . But this effort has also been limited, both in terms of the number of 
people involved-a handful of senior researchers-and owing to the limitations 
of compressibility, currently the only method used to approximate the Kol- 
mogorov complexity of strings. The method presented in this paper aims to 
solve the problem of the evaluation of the complexity of short strings. 

2. Kolmogorov complexity 

Central to AIT is the definition of algorithmic (Kolmogorov-Chaitin or 
program-size) complexity [151 H] : 

K T (s) = mia{\p\,T(p) = s} (1) 

That is, the length of the shortest program p that outputs the string s 
running on a universal Turing machine T. A technical inconvenience of K as 
a function taking s to the length of the shortest program that produces s is its 
uncomputability, proven by reduction to the halting problem. In other words, 
there is no program which takes a string s as input and produces the integer 
K(s) as output. This is usually considered a major problem, but one ought 
to expect a universal measure of complexity to have such a property. The 
measure was first conceived to define randomness and is today the accepted 
objective mathematical measure of complexity, among other reasons because 
it has been proven to be mathematically robust (by virtue of the fact that 
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several independent definitions converge in it). The mathematical theory 
of randomness has demonstrated that properties of random objects can be 
captured by uncomputable measures. If the shortest program producing 
s is larger than \s\, the length of s, then s is considered random. One 
can approach K using compression algorithms that detect regularities in 
order to compress data. The value of the compressibility method is that 
the compression of a string as an approximation to K is a sufficient test of 
non-randomness (or simplicity). 

It was once believed that AIT would prove useless for any real world appli- 
cations despite the beauty of its mathematical results (e.g. a derivation of 
Godel's incompleteness theorem [5]). This was thought to be due to uncom- 
putability and to the fact that the theory's founding theorem (the invariance 
theorem), left finite (short) strings unprotected against an additive constant 
determined by the arbitrary choice of programming language or universal 
Turing machine (upon which one evaluates the complexity of a string), and 
hence unstable and extremely sensitive to this choice. One route to approx- 
imating K is therefore through lossless compression algorithms, which have 
turned out to be capable of providing many applications (see examples in 
PH and [22]). 

Traditionally, the way to approach the algorithmic complexity of a string 
has been by using lossless compression algorithms. The result of a lossless 
compression algorithm is an upper bound of algorithmic complexity. But one 
cannot ever tell when a string is not compressible (one can design ad hoc com- 
pression algorithms, but it is known that computing the optimal compression 
algorithm is an NP complete problem (theorem 2.3.5 [20]). While one can- 
not ever tell when a string is not compressible, if one succeeds at somehow 
shortening a string one can show that its algorithmic complexity cannot be 
larger than the compressed length. Hence compressing is a sufficient test of 
non-randomness. But the converse is not true. 

Short strings are not only difficult to compress in practice, the theory does 
not provide a satisfactory answer to all questions concerning them, such as 
the Kolmogorov complexity of a single bit (which the theory would say has 
maximal complexity because it cannot be further compressed). To make 
sense of such things and close this theoretical gap we devised an alterna- 
tive methodology [12] to compressibility for approximating the complexity 
of short strings, hence a methodology applicable in many areas where short 
strings are often investigated (e.g. in bioinformatics). This method has yet 
to be extended and fully deployed in real applications, and here we take a 
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major step towards full implementation, providing details of the method as 
well CIS db thorough theoretical analysis. 



3. The limits of compression algorithms 

A fair compression algorithm is one that transforms a string into two 
components. The first of these is the compressed version while the other 
is the set of instructions for decompressing the string. Both together ac- 
count for the final length of the compressed version. Thus the compressed 
string comes with its own decompression instructions. Paradoxically, lossless 
compression algorithms are more stable the longer the string. In fact the 
invariance theorem guarantees that complexity values will only diverge by a 
constant c (e.g. the length of a compiler, a translation program between JJ\ 
and U2) and will converge at the limit. 

Invariance Theorem ( (2j [17] ) : If U\ and U2 are two universal Turing ma- 
chines and Kjj^s) and Kjj 2 (s) the algorithmic complexity of s for U\ and U2, 
there exists a constant c such that: 



Hence the longer the string, the less important c is (i.e. the choice of 
programming language or universal Turing machine). However, in practice 
c can be arbitrarily large, thus having a very great impact on finite short 
strings. 

The use of data lossless compression algorithms as a method for approx- 
imating the Kolmogorov complexity of an object (e.g. a string) turns out to 
be accurate in direct proportion to the length of the string. In effect, the 
shorter the string, the greater the margin of error. Too great a margin of 
error means that one cannot really compare evaluations of the complexity of 
short strings. Compression algorithms remain limited in their scope of appli- 
cation. So if one wished to tell which of two strings is more or less randomly 
complex by approximating their algorithmic complexity using a compression 
algorithm, it turns out that there is no way to do so if the strings are relatively 
short (as we can see from the example in Fig. 1, any string of length n < 42 
bits is compressed at about the same length). The resulting compression 
lengths are greater than the length of the original strings, which, if taken as 
Kolmogorov complexity values, would mean that they are all random (even 
when the strings, such as those in Fig. 1, are only simple repetitions of 'Is') 




(2) 
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Figure 1: The problem of short strings: On the a;-axis are groups formed by strings of the 
form 1™ (that is a 1 followed by n Is) that after compression have the same compressed 
length. The results are unsatisfactory for lengths n < 53 (all grouped in x = 1 for which 
y = 42). Strings of length n < 42 are compressed into strings longer than their original 
length. And it is also not uncommon to detect instabilities, such as for group x = 5, which 
for no apparent reason the compression algorithm was able to compress it better before 
returning to the average compression trend. This is not a malfunction of the particular 
compression algorithm (DEFLATE, used in most popular computer formats such as ZIP 
and PNG) or its implementation, but a common issue for lossless compression algorithms 
trying to compress short strings in general. 

3.1. The problem of short strings 

The chief advantage of compression algorithms is that they are sufficient 
test of non-randomness. However, for short strings, which are usually the 
ones useful for practical applications, adding the decompression instructions 
to the compressed version makes the compressed string often, if not always, 
longer than the string itself, simply because the decompression instructions 
are at least equal in length to the original string (see Fig. 1). If the string 
is shorter than the size of the decompression algorithm, say, there will not 
be a way to compress the string into something shorter still. The result will 
be so dependent on the size of the decompression algorithm that the final 
value of the compressed length will be too unstable under different lossless 
compression/ decompression algorithms. 

Given the definition of algorithmic complexity based on compressibility, 
i.e., that the less compressible the more randomly complex a string is, it 
follows immediately that a single bit, or 1, is surely random because it has 
maximal algorithmic complexity, since there is no way to further compress 
a single bit. It is hard to explain how 1 could seem more random than, 
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say, any other possible string. If 1 is maximally random how is it relatively 
more complex than 10 or 1011? (or indeed anything other than 1). So if a 
single bit is the most random among all finite strings, how can there be a 
phase transition from maximal randomness to extreme simplicity? (very low 
Kolmogorov complexity) for, say, strings of length 4, 6 or 10 bits? What if 
one asks how common or 1 are as the output of a computer program? We 
will see that the method proposed herein addresses this issue in an alternative 
manner, providing some answers to these questions. 

4. Solomonoff-Levin Algorithmic Probability 

The algorithmic probability (also known as Levin's semi-measure) of a 
string s is a measure that describes the expected probability of a random 
program p running on a universal (prefix-freaM) Turing machine T producing 



i.e. the sum over all the programs for which T with p outputs s and halts. 

Levin's semi-measur^Jm(s) defines a distribution known as the Universal 
Distribution [T4]. It is important to notice that the value of m(s) is dom- 
inated by the length of the smallest program p (when the denominator is 
larger). The length of the smallest p that produces the string s is, however, 
K(s). The semi-measure m(s) is therefore also uncomputable, because for 
every s, m(s) requires the calculation of 2~ K ( S \ involving K, which is it- 
self uncomputable. An alternative [12] to the traditional use of compression 
algorithms is the use of the concept of algorithmic probability to calculate 
K(s) (see Fig. 2) by means of the following theorem. 

Coding Theorem (Levin [16J): 



An informal interpretation is that if a string has many long descriptions 
it also has a short one. It beautifully connects frequency to complexity, more 



lr The group of valid programs forms a prefix- free set (no clement is a prefix of any 
other, a property necessary to keep < m(s) < 1.) For details see [5]). 

2 It is called semi measure because the sum is never 1, unlike probability measures. 
This is due to the Turing machines that never halt. 



s. Formally [2U [16l S] , 




(3) 




(4) 
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specifically the frequency (or probability) of occurrence of a string with its 
algorithmic (Kolmogorov) complexity. The coding theorem implies that j9j E] 
one can calculate the Kolmogorov complexity of a string from its frequency 
[TTj [TUt |25| [12] , simply rewriting the formula as: 

K(s) = -log 2 m(s) + 0(l) (5) 

An important property of m as a semi-measure is that it dominates any 
other effective semi-measure \i because there is a constant c M such that for all 
s, m(s) > c^fi(s). For this reason m(s) is often called a universal distribution 

4-1. The Busy Beaver 

Notation: We denote by (n, 2) the class (or space) of all n-state 2-symbol 
Turing machines (with the halting state not included among the n states). 

In addressing the problem of approaching m(s) by running computer pro- 
grams (in this case deterministic Turing machines) one can use the known 
values of the so-called Busy Beaver functions as suggested by and used in 
[251 [12] . The Busy Beaver functions XX n > m ) an d S(n,m) can be defined as 
follows: 

Busy Beaver functions (Rado [IS]): If c"t is the number of 'Is' on the tape 
of a Turing machine T with n states and m symbols upon halting starting 
from a blank tape (no input), then the Busy Beaver function ^(n,m) = 
max{a"T : T G (n,m) T halts}. Alternatively, if tr is the number of steps 
that a machine T takes before halting from a blank tape, then 
S(n, m) = max {tr : T e (n,m) T halts}. 

In other words, the Busy Beaver functions are the functions that return 
the longest written tape and longest runtime in a set of Turing machines with 
n states and m symbols. XX n > m ) anc ^ S( n i m ) are noncomputable functions 
by reduction to the halting problem. In fact ^2(n,m) grows faster than any 
computable function can. Nevertheless, exact values can be calculated for 
small n and m, and they are known for, among others, m = 2 symbols and 
n < 5 states. A program showing the evolution of all known Busy Beaver 
machines developed by one of this paper's authors is available online [26] . 

This allows one to circumvent the problem of noncomputability for small 
Turing machines of the size that produce short strings whose complexity is 
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approximated by applying the coding theorem. Fig. 2 shows a flow chart 
of this approximation. As is widely known, the Halting problem for Turing 
machines is the problem of deciding whether an arbitrary Turing machine 
T eventually halts on an arbitrary input s. Halting computations can be 
recognized by running them for the time they take to halt. The problem is 
to detect non-halting programs, programs about which one cannot know in 
advance whether they will run forever or eventually halt. 

4-2. The Turing machine formalism 

It is important to describe the Turing machine formalism because exact 
values of algorithmic probability for short strings will be provided under this 
chosen standard model of a Turing machine. 

Consider a Turing machine with the binary alphabet £ = {0, 1} and n 
states {1,2, . . . n} and an additional Halt state denoted by (as defined by 
Rado in his original Busy Beaver paper [T5]). 

The machine runs on a 2-way unbounded tape. At each step: 

1. the machine's current "state" (instruction); and 

2. the tape symbol the machine's head is scanning 

define each of the following: 

1. a unique symbol to write (the machine can overwrite a 1 on a 0, a on 
a 1, a 1 on a 1, and a on a 0); 

2. a direction to move in: —1 (left), 1 (right) or (none, when halting); 
and 

3. a state to transition into (which may be the same as the one it was in). 

The machine halts if and when it reaches the special halt state 0. There 
are (4n + 2) 2n Turing machines with n states and 2 symbols according to the 
formalism described above. The output string is taken from the number of 
contiguous cells on the tape the head of the halting n-state machine has gone 
through. A machine produces a string upon halting. 
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5. The Coding Theorem Method 

One can attempt to approximate m(s) by running every Turing machine 
following a particular enumeration, for example, a quasi-lexicographical or- 
dering, from shorter to longer (with number of states n and 2 fixed symbols). 
It is clear that in this fashion once a machine produces s for the first time, one 
can directly calculate an exact value of K, because this is the length of the 
first Turing machine in the enumeration of programs of increasing size that 
produces s. Let's formalize this by using the function D(n,m) as the func- 
tion that assigns to every string s produced in (n,m) the quotient: (number 
of times that a machine in (n, m) produces s) / (number of machines that 
halt in (n,m)) as defined in (25J 12] • More formally, 



D(n m)(s) = l{TG(n ' m):r(p) = g}l (6) 
l ' )[ ) \{Te(n,m):T halts }\ [ ) 

Where T(p) is the Turing machine with number p (and empty input) that 
produces s upon halting and \A\ is, in this case, the cardinality of the set A. 
A variation of this formula closer to the definition of m is given by: 

D'(n m)(s) = l{T£(n ' m):T(p) = g}l (7) 
1 ' )[S) \{Te(n,m)}\ [) 

Given that D is strictly smaller than 1 (because of the Turing machines 
that never halt) just as m is but unlike D which for fixed n and m the sum 
will always be 1. We will use Eq. [6] for practical reasons, because it makes the 
frequency values more readable (most machines don't halt, so those halting 
would have a tiny fraction with too many leading zeros after the decimal 
point if written in decimal). 

It was proven in j25*| [T2"] that the function (n,m) — > D(n,m) is non- 
computable by reduction to the halting problem. However, D(n,m) is lower 
semi-computable meaning it can be computably approximated from below, 
for example, by running small Turing machines for which known values of 
the Busy Beaver problem [TO] are known. For example pQ, for n = 4, the 
Busy Beaver function for maximum runtime S, tells us that ^(4, 2) = 107, so 
we know that a machine running on a blank tape will never halt if it hasn't 
halted after 107 steps, and so we can stop it manually. In what follows we 
describe the exact methodology. From now on, D(n) with a single parameter 
will mean D(n,2). 
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coding theorem 
I 

\ 

K(s) = -log Pr(s) + c 



TM produces s for the 

first time, thus 
K(s) = {|TM n | in bits} 



*" K Pr(s) ~K(s) 



Figure 2: A flow chart illustrating the Coding Theorem Method, a never-ending algorithm 
for evaluating the (Kolmogorov) complexity of a (short) string making use of several con- 
cepts and results from theoretical computer science, in particular the halting probability, 
the Busy Beaver problem, Levin's semi-measure and the Coding theorem. The Busy 
Beaver values can be used up to 4 states for which they are known, for more than 4 states 
an informed maximum runtime is used (see Section 6.1 1. Notice that Pr are the probability 
values calculated dynamically by running an increasing number of Turing machines. Pr is 
intended to be an approximation to m(s) out of which we build D(n) after application of 
the Coding theorem. 



We call this method the Coding Theorem Method to approximate K 
(which we will denote by K m ). 

6. Methodology for calculating -D(5) 

Previously [23 H2] we had calculated the full output distribution of Turing 
machines with 2-symbols and n = 1, ...,4 states for which the Busy Beaver 
values are known, in order to determine the halting time, that is a total of 
36, 10 000, 7 529 536 and 11019 960 576 Turing machines respectively. The 
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formula for the number of machines given a number of states n is given by 
(An + 2) 2n derived from the formalism described in Section |4.2| There are 
therefore 26 559 922 791 424 Turing machines with 5 state^J 

Because there are a large enough number of machines to run even for a 
small number of machine states (n), applying the coding theorem provides 
a finer and increasingly stable evaluation of K(s) based on the frequency of 
production of a large number of Turing machines, but the number of Tur- 
ing machines grows exponentially, and producing D(5) requires considerable 
computational resources. 

6.1. Setting the runtime 

The Busy Beaver for Turing machines with 4 states is known to be 107 
steps [lj, that is, any Turing machine with 2 symbols and 4 states running 
longer than 107 steps will never halt. However, the exact number is not 
known for Turing machines with 2 symbols and 5 states, although it is be- 
lieved to be 47 176 870, as there is a candidate machine that runs for this 
long and halts and no machine greater runtime has yet been found. 

So we decided to let the machines with 5 states run for 4.6 times the 
Busy Beaver value for 4-state Turing machines (for 107 steps), knowing that 
this would constitute a sample significant enough to capture the behavior 
of Turing machines with 5 states. The chosen runtime was rounded to 500 
steps, which was used to build the output frequency distribution for D(h). 
The theoretical justification for the pertinence and significance of the chosen 
runtime is provided in Section |9j 

7. Reduction techniques 

We didn't run all the Turing machines with 5 states to produce D(5) 
because one can take advantage of symmetries and anticipate some of the 
behavior of the Turing machines directly from their transition tables with- 
out actually running them (this is impossible in general due to the halting 
problem). We avoided some trivial machines whose results we know without 
having to run them (reduced enumeration). Also, some non-halting machines 
were detected before consuming all the runtime (filters). The following are 



3 That is, for the reader amusement, about the same number of red cells in the blood 
of an average adult. 
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the reductions utilized in order to reduce the number of total machines and 
therefore the computing time for the approximation of D(5). 



7.1. Exploiting symmetries 

7.1.1. Symmetry of and 1 

No transition starting from the halting state exists, and the blank symbol 
is one of the 2 symbols (0 or 1) in the first run, while the other is used in 
the second run (in order to avoid any asymmetries due to the choice of a 
single blank symbol). In other words, we ran each machine twice, one with 
as the blank symbol (the symbol with which the tape starts out and fills 
up), and an additional run with 1 as the blank symbol. This means that 
every machine was run twice. Due to the symmetry of the computation, 
there is no real need to run each machine twice; one can complete the string 
frequencies by assuming that each string produced its complement with the 
same frequency, and then group and divide by symmetric groups. We used 
this technique from D(l) to D(A). A more detailed explanation of how this 
is done is provided in [251 [12] using Polya's counting theorem. 

7.1.2. Symmetry right-left 

We can exploit the right-left symmetry. We may, for example, run only 
those machines with an initial transition (initial state and blank symbol) 
moving to the right and to a state different from the initial one (because an 
initial transition to the initial state produces a non-halting machine) and the 
halting one (these machines stop in just one step and produce '0' or T). 

For every string produced, we also count the reverse in the tables. We 
count the corresponding number of one-symbol strings and non-halting ma- 



chines as well (see Subsection 7.3 for details). 



7.2. Reduction techniques by exploiting symmetries 

If we consider only machines with a starting transition that moves to 
the right and goes to a state other than the starting and halting states, the 
number of machines is given by 

totalin) := 2{n - l)((4n + 2) 2 ™" 1 ) 

Note that for the starting transition there are 2(n — 1) possibilities (2 possible 
symbols to write and n — 1 possible new states, as we exclude the starting and 
halting states). For the other 2n — l transitions there are 4n + 2 possibilities. 
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We can make an enumeration from to total(n) — 1. Of course, this 
enumeration is not the same as the one we use to explore the whole space. 
The same number will not correspond to the same machine. 

In the whole -D(5) space there are (An + 2) 2n machines, so it's a consider- 
able reduction. This reduction in -D(5) means that in the reduced enumera- 
tion we have 4/11 of the machines we had in the original enumeration. 

7.3. Strings to complete after running the reduced enumeration 

Suppose that using the previous enumeration we run M machines for 

D(n) with blank symbol 0. M can be the total number of machines in the 

reduced space or a random number of machines in it (such as we use to study 

the runtime distribution, as described in Section [9]). 

For the starting transition we considered only 2(n — 1) possibilities out of 

4n + 2 possible transitions in the whole space. Then, we proceeded as follows 

to complete the strings produced by the M runs. 

1. We avoided 2(n — 1) transitions moving left to a different state than the 
halting and starting ones. We completed such transitions by reversing 
all the strings found. Non-halting machines were multiplied by 2. 

2. We also avoided 2 transitions (writing '0' or '1') from the initial to the 
halting state. We completed such transitions by 

• Including 2( *f_ 1 s times '0'. 

• Including 2( *f_ 1 s times '1'. 

3. Finally, we avoided 4 transitions from the initial state to itself (2 move- 
ments x 2 symbols). We completed by including non-halting ma- 
chines. 

With these completions, we obtained the output strings for the blank 
symbol 0. To complete for the blank symbol 1 we took the complement to 1 
of each string produced and counted the non-halting machines twice. 

Then, by running M machines, we obtained a result representing M ^"^ , 
that for n = 5 is 5.5M. 

7.4- Detecting non-halting machines 

It's useful to avoid running machines that we can easily check that won't 
stop. These machines will consume the runtime without yielding an output. 

The reduction in the enumeration that we've shown reduces the number 
of machines to be generated. Now we present some reductions that work after 
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the machines are generated, in order to detect non-halting computations and 
skip running them. Some of these were detected when filling the transition 
table, others at runtime. 

7.4-1- Machines without transitions to the halting state 

While we're filling the transition table, if a certain transition goes to the 
halting state, we can activate a flag. If after completing the transition table 
the flag is not activated, we know that the machine won't stop. 

In our reduced enumeration there are 2{n — 1) ((4n) 2ri_1 ) machines of this 
kind. In D{h) this is 4.096 x 10 12 machines. It represents 42.41% of the total 
number of machines. 

The number of machines in the reduced enumeration that are not filtered 
as non-halting when filling the transition table is 5,562,153,742,336. That is 
504.73 times the total number of machines that fully produce D(A). 

7-4-2. Detecting escapees 

There should be a great number of escapees, that is, machines that run 
infinitely in the same direction over the tape. 

Some kinds are simple to check in the simulator. We can use a counter 
that indicates the number of consecutive not-previously- visited tape positions 
that the machines visits. If the counter exceeds the number of states, then 
we have found a loop that will repeat infinitely. To justify this, let us ask you 
to suppose that at some stage the machine is visiting a certain tape-position 
for the first time, moving in a specific direction (the direction that points 
toward new cells). If the machine continues moving in the same direction for 
n + l steps, and thus reading blank symbols, then it has repeated some state 
s in two transitions. As it's always reading (not previously visited) blank 
symbols, the machine has repeated the transition for (s, b) twice, b being the 
blank symbol. But the behavior is deterministic, so if the machine has used 
the transition for (s, b) and after some steps in the same direction visiting 
blank cells, it has repeated the same transition, it will continue doing so 
forever, because it will always find the same symbols. 

There is another possible direction in which this filter may apply: if the 
symbol read is a blank one not previously visited, the shift is in the direction 
of new cells and there is no modification of state. In fact this would be 
deemed an escapee, because the machine runs for n + l new positions over 
the tape. But it's an escapee that's especially simple to detect, in just one 
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step and not n + 1. We call the machines detected by this simple filter "short 
escapees" , to distinguish them from other, more general escapees. 

7.4-3. Detecting cycles 

We can detect cycles of period two. They are produced when in steps s 
and s + 2 the tape is identical and the machine is in the same state and the 
same position. When this is the case, the cycle will be repeated infinitely. 
To detect it, we have to anticipate the following transition that will apply in 
some cases. In a cycle of period two, the machine cannot change any symbol 
on the tape, because if it did, the tape would be different after two steps. 
Then the filter would be activated when there is a transition that doesn't 
change the tape, for instance 

{s, k} -> {s', k,d} 

where d G { — 1, 1} is some direction (left, right) and the head is at position 
i on tape t, which is to say, reading the symbol t[i]. Then, there is a cycle of 
period two if and only if the transition that corresponds to {s',t[i + d]} is 

{s',t[i + d]} -»■ {s,t[i + d], -d} 

7. 5. Checking the filters 

We calculated D(A) with and without all the filters as was done in jT2], 
arriving at exactly the same results with as without, and thereby validating 
our reduction techniques. 

Running D(4) without reducing the number of or detecting non-halting 
machines took 952 minutes. Running the reduced enumeration with non- 
halting detectors took 226 minutes. 

We filtered the following non-halting machines: 



Filter 


number of TMs 


machines without transitions to the halting state 

short escapees 

other escapees 

cycles of period two 

machines that consume all the runtime 

Total 


1610 612 736 
464 009 712 
336 027900 
15413112 
366 784524 
2792847984 
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8. A glance at D(5) 



Running all the Turing machines with 5 states in the reduced enumeration 

see 



7.2) up to 500 steps for the calculation of D(5) took 18 days using 25 
x86-64 CPUs running at 2128 MHz with 4 GB of memory each^J In order to 
save space in the output of D(5), our C++ simulator produced partial results 
every 10 9 consecutive machines according to the enumeration constructed as 
described in Section [7} Every 10 9 machines, the counters for each string 
produced were updated. The final classification is only 4.1 Megabytes but 
we can estimate the size of the output had we not produced partial results 
on the order of 1.28 Terabytes for the reduced space and 6.23 Terabytes 
for the full one. If we were to include in the output an indication for non- 
halting machines, the files would grow an extra 1.69 Terabytes for the reduced 
enumeration and 8.94 Terabytes for the full one. 



Table [T] provides a glance at D(5) showing the 147 most frequent (and 
therefore simplest) calculated strings out of 99 608. The top strings of D(5) 
conform to an intuition of simplicity. Table [2] shows all the 2 n strings for 
n = 7, hence displaying what D(5) suggests are the strings sorted from 
lowest to highest complexity, which seems to agree well with the intuition of 
simple (from top left) to random-looking (bottom right). 



9. Reliability of the approximation of D(5) 

Not all 5-state Turing machines have been used to build -D(5), since only 
the output of machines that halted at or before 500 steps were taken into 
consideration. As an experiment to see how many machines we were leaving 
out, we ran 1.23 x 10 10 Turing machines for up to 5000 steps (see Fig. |3ji). 
Among these, only 50 machines halted after 500 steps and before 5000 (that 
is less than 1.75164 x 10 -8 because in the reduced enumeration we don't 
include those machines that halt in one step or that we know won't halt before 
generating them, so it's a smaller fraction), with the remaining 1496 491379 
machines not halting at 5000 steps. As far as these are concerned-and given 
the unknown values for the Busy Beavers for 5 states-we do not know after 



4 A supercomputer located at the Centra Informatico Cienti'fico de Andaluci'a (CICA), 
Spain. 
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Table 1: The 147 most frequent strings from D(5) (by row). The first column is a counter 
to help locate the rank of each string. 



1 


1 





11 


10 


01 


00 


111 


8 


000 


110 


100 


011 


001 


101 


010 


15 


1111 


0000 


1110 


1000 


0111 


0001 


1101 


22 


1011 


0100 


0010 


1010 


0101 


1100 


0011 


29 


1001 


0110 


11111 


00000 


11110 


10000 


01111 


36 


00001 


11101 


10111 


01000 


00010 


11011 


00100 


43 


10110 


10010 


01101 


01001 


10101 


01010 


11010 


50 


10100 


01011 


00101 


11100 


11000 


00111 


00011 


57 


11001 


10011 


01100 


00110 


10001 


OHIO 


111111 


64 


000000 


111110 


100000 


011111 


000001 


111101 


101111 


71 


010000 


000010 


101010 


010101 


101101 


010010 


111011 


78 


110111 


001000 


000100 


110101 


101011 


010100 


001010 


85 


101001 


100101 


011010 


010110 


110110 


100100 


011011 


92 


001001 


111100 


110000 


001111 


000011 


101110 


100010 


99 


011101 


010001 


110010 


101100 


010011 


001101 


111001 


106 


100111 


011000 


000110 


111010 


101000 


010111 


000101 


113 


100110 


011001 


110011 


001100 


100001 


011110 


110100 


120 


001011 


111000 


000111 


110001 


100011 


011100 


001110 


127 


1111111 


0000000 


1111110 


1000000 


0111111 


0000001 


1010101 


134 


0101010 


1111101 


1011111 


0100000 


0000010 


1111011 


1101111 


141 


0010000 


0000100 


1110111 


0001000 


1111100 


1100000 


0011111 



how many steps they would eventually halt, if they ever do. According to the 
following analysis, our election of a runtime of 500 steps therefore provides a 
good estimation of D(5). 

The frequency of runtimes of (halting) Turing machines has theoretically 
been proven to drop exponentially [3 J , and our experiments are closer to the 
theoretical behavior (see Fig. [3|. To estimate the fraction of halting machines 
that were missed because Turing machines with 5 states were stopped after 
500 steps, we hypothesize that the number of steps S a random halting 
machine needs before halting is an exponential RV (random variable), defined 
by VA; > 1, P(S = k) oc e~ Xk . We do not have direct access to an evaluation 
of P(S = k), since we only have data for those machines for which S < 5000. 
But we may compute an approximation of P(S = k\S < 5000), 1 < k < 5000, 
which is proportional to the desired distribution. 

A non-linear regression using ordinary least-squares gives the approxima- 
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Table 2: All the 2™ strings for n = 7 from -D(5) sorted from highest frequency (hence 
lowest complexity) to lowest frequency (hence highest (random) complexity). Strings in 
each row have the same frequency (hence the same Kolmogorov complexity). There are 
31 different groups representing the different complexities of the 2 7 = 128 strings. 



1 


1111111 


0000000 














2 


1111110 


1000000 


0111111 


0000001 










3 


1010101 


0101010 














4 


1111101 


1011111 


0100000 


0000010 










5 


1111011 


1101111 


0010000 


0000100 










6 


1110111 


0001000 














7 


1111100 


1100000 


0011111 


0000011 










8 


1011010 


1010010 


0101101 


0100101 










9 


1101101 


1011011 


0100100 


0010010 


1111001 


1001111 


0110000 


0000110 


10 


1110101 


1010111 


0101000 


0001010 










11 


1101110 


1000100 


0111011 


0010001 










12 


1101010 


1010100 


0101011 


0010101 










13 


1010110 


1001010 


0110101 


0101001 










14 


1111010 


1010000 


0101111 


0000101 










15 


mono 


1001000 


0110111 


0001001 










16 


1010001 


1000101 


0111010 


0101110 










17 


1011110 


1000010 


0111101 


0100001 










18 


1011101 


0100010 














19 


1101011 


0010100 


1001001 


0110110 










20 


1110011 


1100111 


0011000 


0001100 










21 


1100101 


1010011 


0101100 


0011010 










22 


1011001 


1001101 


0110010 


0100110 


1000001 


0111110 






23 


1111000 


1110000 


0001111 


0000111 


1101001 


1001011 


0110100 


0010110 


24 


1110010 


1011000 


0100111 


0001101 


1101100 


1100100 


0011011 


0010011 


25 


1100010 


1011100 


0100011 


0011101 










26 


1100110 


1001100 


0110011 


0011001 










27 


1001110 


1000110 


0111001 


0110001 










28 


1100001 


1000011 


0111100 


0011110 










29 


1110001 


1000111 


0111000 


0001110 










30 


1100011 


0011100 














31 


1110100 


1101000 


0010111 


0001011 











tion P(S = k\S < 5000) = ae~ Xk with a = 1.12 and A = 0.793. The residual 
sum-of-squares is 3.392 x 10~ 3 , the number of iterations 9 with starting values 
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(c) D(4) 



(d) D(5) 



5000 
4000 



v.... 



Figure 3: Distribution of runtimes from D(2) to D(b). On the y-axes are the number of 
Turing machines and on the x-axes the number of steps upon halting (notice that given 
that for 5-states not known Busy Beaver values are known, -D(5) (Fig. d) was produced by 
all the Turing machines with 5 states that run for at most t — 500 steps. These plots show 
that the tendencies suggest that the runtime cutoff t — 500 for the production of D(b) 
covers most of the halting Turing machines, hence the missed machines are neglectiblc.) 

a = 0.4 and A = 0.25. Fig. [4] helps to visualize how the model fits the data. 

The model's A is the same A appearing in the general law P(S = k), and 
may be used to estimate the number of machines we lose by using a 500 step 
cut-off point for running time: P(k > 500) fa e~ 500A k 6 x 10~ 173 . This 
estimate is far below the point where it could seriously impair our results: 
the less probable (non-impossible) string according to D(5) has an observed 
probability of 1.13 x 10~ 9 . 

Although this is only an estimate, it suggests that missed machines are 
few enough to be considered negligible. 
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1 1 1 1 1 1 1 1 — 

1 5 10 50 100 500 5000 




1 1 1 1 1 1 1 1 — 

1 5 10 50 100 500 5000 

Running Time 



Figure 4: Observed (solid) and theoretical (dotted) P(S = k\S < 5000) against k. The 
a;-axis is logarithmic. Two different scales are used on the y-axis to allow for a more precise 
visualization. 

10. Features of 23(5) 

10.1. Lengths 

5-state Turing machines produced 99,608 different binary strings (to be 
compared to the 1832 strings for D(4)). While the largest string produced for 
-D(4) was of length 16 bits and only all 2 n strings for n = 8 were produced, 
the strings in D(5) have lengths from 1 to 49 bits (excluding lengths 42 
and 46 that never occur) and include every possible string of length I < 12. 
Among the 12 bit strings, only two were not produced (000110100111 and 
111001011000). Of n = 13, . . . , 15 about half the 2 n strings were produced 
(and therefore have frequency and complexity values). Fig. [5] shows the 
proportion of n-long strings appearing in D(6) outputs, for n G {1, . . . , 49}. 
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n = 12 (2 strings missing) 



\ 

*••••••••••••••••••••••• •••• • 

1 1 1 r 

10 20 30 40 50 

n 

Figure 5: Proportion of all n-long strings appearing in D(5) against n 

The cumulative probability of every n-long string gives a probability law 
on N*. Fig. [6] shows such a law obtained with D(5), with -D(4), and with 
the theoretical 2~ n appearing in Levin's semi-measure. The most important 
difference may be the fact that this law does not decrease for D(5), since 
length 2 is more likely than length 1. 

10.2. Global simplicity 

Some binary sequences may seem simple from a global point of view 
because they show symmetry (1011 1101) or repetition (1011 1011). Let us 
consider the string s = 1011 as an example. We have Pd(5)(s) — 3.267414 x 
10 -3 . The repetition ss = 10111011 has a much lower probability Pd(5)(ss) = 
4.645999 x 10~ 7 . This is not surprising considering the fact that ss is much 
longer than s, but we may then wish to consider other strings based on s. In 
what follows, we will consider three methods (repetition, symmetrization, 0- 
complementation). The repetition of s is ss = 10111011, the "symmetrized" 
ss = 10111101, and the 0-complementation 10110000. These three strings of 
identical length have different probabilities (4.645999 x 10~ 7 , 5.335785 x 10~ 7 
and 3.649934 x 10~ 7 respectively). 

Let us now consider all strings of length 3 to 6, and their symmetriza- 
tion, 0-complementation and repetition. Fig. [7] is a visual presentation of 




21 



in 
d 




o D4 






• D5 






1/2" 




o o 




d 


• 




0.2 0.3 






d 

o 
d 








2 4 6 8 10 12 

Length 



Figure 6: Cumulative probability of all n-long strings against n 

the results. In each case, even the minimum mean between the mean of 
symmetrized, complemented and repeated patterns (dotted horizontal line) 
lies in the upper tail of the D(5) distribution for 2n-length strings. And this 
is even more obvious with longer strings. Symmetry, complementation and 
repetition are, on average, recognized by D(5). 

Another method for finding "simple" sequences is based on the fact that 
the length of a string is negatively linked to its probability of appearing 
in D(5). When ordered by decreasing probability, strings show increasing 
lengths. Let's call those sequences for which length is greater than that of 
the next string "climbers". The first 50 climbers appearing in D(5) are given 
in Tableland show subjectively simple patterns, as expected. 

Strings are not sorted by length but follow an interesting distribution of 
length differences that agrees with our intuition of simplicity and randomness 
and is in keeping with the expectation from an approximation to m(s) and 
therefore K(s). 

10.3. Binomial behavior 

In a random binary string of length n, the number of '0s' conforms to a 
binomial law given by P(k) = 2~ n m . On the other hand, if a random Turing 
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Figure 7: Mean ± standard deviation of D(5) of 2n-long strings given by process of sym- 
metrization (Sym), O-complementation (Comp) and repetition (Rep) of all n-long strings. 
The dotted horizontal line shows the minimum mean among Sym, Comp and Rep. The 
density of D(5) (smoothed with Gaussian kernel) for all 2n-long strings is given in the 
right-margin. 







■ Sym 

• Comp 
a Rep 

y 


II 1 

n = 4 







machine is drawn, simpler patterns are more likely to appear. Therefore, the 
distribution arising from Turing machine should be more scattered, since 
most simple patterns are often unbalanced (such as 0000000). This is indeed 
what Fig. [8] shows: compared to truly random sequences of length n, -D(5) 
yields a larger standard deviation. 

10.4- A Bayesian approach 

D(5) allows us to determine, using a Bayesian approach, the probability 
that a given sequence is random: Let s be a sequence of length I. This 
sequence may be produced by a machine, let's say a 5-state Turing machine 
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Table 3: Minimal examples of emergence: the first 50 climbers. 



00000000 


000000000 


000000001 


000010000 


010101010 


000000010 


000000100 


0000000000 


0101010101 


0000001010 


0010101010 


00000000000 


0000000010 


0000011010 


0100010001 


0000001000 


0000101010 


01010101010 


0000000011 


0101010110 


0000000100 


0000010101 


000000000000 


0000110000 


0000110101 


0000000110 


0110110110 


00000010000 


0000001001 


00000000001 


0010101101 


0101001001 


0000011000 


00010101010 


01010010101 


0010000001 


00000100000 


00101010101 


00000000010 


00000110000 


00000000100 


01000101010 


01010101001 


01001001001 


010101010101 


01001010010 


000000000001 


00000011000 


00000000101 


0000000000000 



(event M), or by a random process (event R). Let's set the prior probability 
at P(R) = P(M) = \. Because s does not have a fixed length, we cannot 
use the usual probability P(s) = ^, but we may, following Levin's idea, use 
P(s\R) = ~3r. Given s, we can compute 

with 

P(s) = P{s\M)P{M) + P(s\R)P(R) = + 



2 2 2l+1 ' 

Since P(R) = \ and P(s\R) = ^i, the formula becomes 

P(R\s) 



2^P D{5) (s) + l 

There are 16 strings s such that P(R\s) < 10 -16 (the "least random 
strings"). Their lengths lie in the range [47, 49]. An example is: 111011101110 
11101110111011101110111011111010101. The fact that the "least random" 
strings are long can intuitively be deemed correct: a sequence must be 
long before we can be certain it is not random. A simple sequence such 
as 00000000000000000000 (twenty '0s') gives a P(R\s) = 0.006. The size 
does matter, and there is no sequence of 40 or more '0s' in D(5). 

A total of 192 strings achieve a P(R\s) > 1 - 1.7 x 10~ 4 . They all are 
of length 12 or 13. Examples are the strings 1110100001110, 1101110000110 
or 1100101101000. This is consistent with our idea of a random sequence. 
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02468 10 02468 10 02468 10 12 

Figure 8: Distributions of the number of zeros in n-long binary sequences according to a 
truly random drawing (red, dotted), or a D(5) drawing (black, solid) for length 4 to 12 

However, the fact that only lengths 12 and 13 appear here may be due to 
the specificity of D(5). 

11. Comparing D(4) and -D(5) 

Every 4-state Turing machine may be modeled by a 5-state Turing ma- 
chine whose fifth state is never attained. Therefore, the 1832 strings produced 
by D(4) calculated in |12j also appear in D(6). We thus have 1832 ranked 
elements in D{4) to compare with. The basic idea at the root of this work 
is that D(5) is a refinement (and major extension) of D(4), previously cal- 
culated in an attempt to understand and evaluate algorithmic complexity. 
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This would be hopeless if .0(4) and D(5) led to totally different measures 
and rankings of simplicity versus complexity (randomness). 

11.1. Agreement in probability 
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Figure 9: D(5) against D(i), for n-long strings 

The link between D(4) and D(5) seen as measures of simplicity may be 
measured by the determination coefficient r 2 , r being the Pearson correlation 
coefficient. This coefficient is r 2 = 99.23%, which may be understood as 
"L>(4) explains 99.23% of the variations of D(5)". The scatterplot in Fig. g 
displays D(5)(s) against .0(4) (s) for all strings s of length n = 3, ... ,8 (8 
being the largest integer I such that -0(4) comprises every /-long sequence). 
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The agreement between D{5) and D(4) is almost perfect, but there are 
still some differences. Possible outliers may be found using a studentized 
residual in the linear regression of D(6) against D(4). The only strings giv- 
ing absolute studentized residuals above 20 are and 1. The only strings 
giving absolute studentized residuals lying between 5 and 20 are all the 3-long 
strings. All 4- long strings fall between 2 and 5. This shows that the differ- 
ences between D(5) and D(4) may be explained by the relative importance 
given to the diverse lengths, as shown above (Fig. [6]). 

11.2. Agreement in rank 




500 1000 1500 



Figure 10: R5 (rank according to D(5)) against R4. The grayscale indicates the length 
of the strings: the darker the point, the shorter the string. 

There are some discrepancies between D(5) and D(4) due to length ef- 
fects. Another way of studying the relationship between the two measures 
is to turn our attention to ranks arising from D(5) and D(4). The Spear- 
man coefficient is an efficient tool for comparing ranks. Each string may be 
associated with a rank according to decreasing values of D(5) (R5) or D(4) 
(R4). A greater rank means that the string is less probable. Fig. |To| displays 
a scatterplot of ranks according to D(5) as a function of Z}(4)-rank. Visual 
inspection shows that the ranks are similar, especially for shorter sequences. 
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The Spearman correlation coefficient amounts to 0.9305, indicating strong 
agreement. 

Not all strings are equally ranked and it may be interesting to take a closer 
look at outliers. Table [4] shows the 20 strings for which |_R 4 — R 5 \ > 600. All 
these sequences are ties in D(4), whereas D(5) distinguishes 5 groups. Each 
group is made up of 4 equivalent strings formed from simple transformations 
(reversing and complementation). This confirms that D{5) is fine-grained 
compared to D(4). 

The shortest sequences such that \R± — R$\ > 5 are of length 6. Some of 
them show an intriguing pattern, with an inversion in ranks, such as 000100 
(i?5 = 85,i?4 = 77) and 101001 with reversed ranks. 

Table 4: The 20 strings for which |-R 4 --R 5 | > 600 



sequence 


i?4 


R 5 


010111110 


1625 


5 


837.5 


011111010 


1625 


5 


837.5 


100000101 


1625 


5 


837.5 


101000001 


1625 


5 


837.5 


000011001 


1625 


5 


889.5 


011001111 


1625 


5 


889.5 


100110000 


1625 


5 


889.5 


111100110 


1625 


5 


889.5 


001111101 


1625 


5 


963.5 


010000011 


1625 


5 


963.5 


101111100 


1625 


5 


963.5 


110000010 


1625 


5 


963.5 


0101010110 


1625 


5 


1001.5 


0110101010 


1625 


5 


1001.5 


1001010101 


1625 


5 


1001.5 


1010101001 


1625 


5 


1001.5 


0000000100 


1625 


5 


1013.5 


0010000000 


1625 


5 


1013.5 


1101111111 


1625 


5 


1013.5 


1111111011 


1625 


5 


1013.5 



On the whole, D(5) and D(4) are similar measures of simplicity, both from 
a measurement point of view and a ranking point of view. Some differences 
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may arise from the fact that -0(5) is more fine-grained than .0(4). Other 
unexpected discrepancies still remain: we must be aware that .0(5) and -0(4) 
are both approximations of a more general limit measure of simplicity versus 
randomness. Differences are inevitable, but the discrepancies are rare enough 
to allow us to hope that -0(5) is for the most part a good approximation of 
this properties. 

12. Kolmogorov complexity calculation 

It is now straightforward to apply the coding theorem (see Section 
to convert string frequency (as a numerical approximation of algorithmic 
probability m(s)) to an exact evaluation of Kolmogorov complexity KD( n ){s) 
(see Tables [5] and |6]) . Formally, 

K D{n) (s) = -\og 2 D(n)(s) 

First it is worth noting that the calculated complexity values in Tables [5] 
and [6] are real numbers, when they are supposed to be the lengths of programs 
(in bits) that produce the strings, hence integers. The obvious thing to do is 
to round the values to the next closest integer, but this would have a negative 
impact as the decimal expansion provides a finer classification. Hence the 
finer structure of the classification is favored over the exact interpretation of 
the values as lengths of computer programs. It is also worth mentioning that 
the lengths of the strings (as shown in Table [6]) are almost always smaller than 
their Kolmogorov (program-size) values, which is somehow to be expected 
from this approach. Consider the single bit. It not only encodes itself, but 
the length of the string (1 bit) as well, because it is produced by a Turing 
machine that has reached the halting state and produced this output upon 
halting. 

Consider the Shannon entropy of a single bit calculated from -0(5). Given 
that .0(5) is not uniformly distributed (just as m(s) is not, by definition), 
receiving a single bit does not carry only the information of the bit in ques- 
tion, because the chances of not getting any other bit are exponentially larger 
than getting one. Just as the occurrence of a particular letter in English de- 
termines with some probability the next letter (e.g. 3 "e"s in English never 
normally occur, thus "e" brings with it more information than just "e"). In 
the same way we think that strings produced by halting machines also encode 
information about their length. 
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Table 5: Top 20 strings in D(5) with highest frequency and therefore lowest Kolmogorov 
(program-size) complexity. From frequency (middle column) to complexity (extreme right 
column) applying the coding theorem in order to get K D ^ which we will call K m ^ as our 
current best approximation to an experimental m(s), that is D(5), through the Coding 
theorem. 



sequence 


frequency (m(s)) 


complexity (K m (s)) 


1 


0.175036 


2.51428 


o 


0.175036 


2.51428 


11 


0.0996187 


3.32744 


10 


0.0996187 


3.32744 


01 


0.0996187 


3.32744 


00 


0.0996187 


3.32744 


111 


0.0237456 


5.3962 


000 


0.0237456 


5.3962 


110 


0.0229434 


5.44578 


100 


0.0229434 


5.44578 


011 


0.0229434 


5.44578 


001 


0.0229434 


5.44578 


101 


0.0220148 


5.50538 


010 


0.0220148 


5.50538 


1111 


0.0040981 


7.93083 


0000 


0.0040981 


7.93083 


1110 


0.00343136 


8.187 


1000 


0.00343136 


8.187 


0111 


0.00343136 


8.187 


0001 


0.00343136 


8.187 



Also worth noting is the fact that the strings 00, 01, 10 and 11 all have 
the same complexity, according to our calculations (this is the case from D{2) 
to D(5)). It might just be the case that the strings are too short to really 
have different complexities, and that a Turing machine that can produce 
one or the other is of exactly the same length. To us the string 00 may 
look more simple than 01, but we do not have many arguments to validate 
this intuition for such short strings, and it may be an indication that such 
intuition is misguided (think in natural language, if spelled out in words, 00 
doesn't seem to have a much shorter description than the shortest description 
of 01). 
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Compare this phenomenon of program-sizes being greater than the length 
of these short strings to the extent of the problem posed by compression al- 
gorithms (see Fig. 1), which collapse all strings of up to length 40 at least, 
producing the same complexity approximations for all of them. One way to 
overcome this minor inconvenience involved in using the alternative approach 
developed here is to subtract a constant (no greater than the smallest com- 
plexity value) from all the complexity values, which gives these strings lower 
absolute random complexity values (preserving the relative order). But even 
if left "random" , this alternative technique can be used to distinguish and 
compare them, unlike the lossless compression approach that is unable to 
further compress short strings. 



Tabic 6: 20 random strings (sorted from lowest to highest complexity values) from the 
first half of D(5) to which the coding theorem has been applied (extreme right column) 
to approximate K(s). 



string length 


string 


complexity (K m (s)) 


11 


11011011010 


28.1839 


12 


101101110011 


32.1101 


12 


110101001000 


32.1816 


13 


0101010000010 


32.8155 


14 


11111111100010 


34.1572 


12 


011100100011 


34.6045 


15 


001000010101010 


35.2569 


16 


0101100000000000 


35.6047 


13 


0110011101101 


35.8943 


15 


101011000100010 


35.8943 


16 


1111101010111111 


25.1313 


18 


000000000101000000 


36.2568 


15 


001010010000000 


36.7423 


15 


101011000001100 


36.7423 


17 


10010011010010011 


37.0641 


21 


100110000000110111011 


37.0641 


14 


11000010000101 


37.0641 


17 


01010000101101101 


37.4792 


29 


01011101111100011101111010101 


37.4792 


14 


11111110011110 


37.4792 
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The phenomenon of complexity values greater than the lengths of the 
strings is transitional. Out of the 99,608 strings in D(5), 212 have greater 
string lengths than program-size values. The first string to have a smaller 
program-size value than string length is the string 10101010101010101010 
101010101010101010101 (and its complementation), of length 41 but program- 
size of 33.11 (34 if rounded). The mean of the strings with greater program- 
size than length is 38.3494, The string with the greatest difference between 
length and program-size in D(5) are strings of low Kolmogorov complexity 
such as 0101010001000100010001000100010001000100010001010, of length 
49 but with an approximated Kolmogorov complexity (program-size) value 
of 39.0642. Hence far from random, both in terms of the measure and in 
terms of the string's appearance. 

12.1. Randomness in D(5) 

Paradoxically, the strings at the bottom of D(5) as sorted from highest 
to lowest frequency and therefore lowest to highest Kolmogorov (random) 
complexity are not very random looking, but this is to be expected, as the 
actual most random strings of these lengths would have had very low fre- 
quencies and would not therefore have been produced. In fact what we are 
looking at are some of the strings with the greatest structure (lowest Kol- 
mogorov complexity) that made it into D(5) (most of them produced by a 
single Turing machine). Table [7j however, shows the bottom of the length 
n = 12 classification extracted from D(5), for which all 2 n binary strings 
were produced, hence displaying more apparent randomness. 



Table 7: Bottom 21 strings of length n = 12 with smallest frequency in D(5). 



100111000110 


100101110001 


100011101001 


100011100001 


100001110001 


011110001110 


011100011110 


011100010110 


011010001110 


011000111001 


000100110111 


111000110100 


110100111000 


001011000111 


000111001011 


110100011100 


110001110100 


001110001011 


001011100011 


110000111100 


001111000011 
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12.2. Robustness of Kr,{ n ) 

An important question is how robust is K^ua-i that is how sensitive it is 
to n. We know that the invariance theorem (Section [2]) guarantees that the 
values converge in the long term, but the invariance theorem tells nothing 



about the rate of convergence. In Section 11.2 we have shown that D(n + 1) 
respects the order of D(n) except for very few and minor value discrepancies 
concerning the least frequent strings (and therefore the most unstable given 
the few machines generating them). This is not obvious despite the fact 
that all Turing machines with n states in (n, m) are included in the space of 
(n + 1, m) machines (that is, the machines that never reach one of the n + 1 
states), because the number of machines in (n + 1, m) overcomes by far the 
number of machines in (n,m), and a completely different result could have 
been then produced. However, the agreement between D(n) and D(n + 1) 
seems to be similarly high among, and despite, the few cases n < 6 in hand 
to compare with. The only way for this behaviour to radically change for 
n > 5 is if for some n', D(n') starts diverging in ranks from D(n' — 1) on 
before starting to converge again (by the invariance theorem). If one does 
not have any reason to believe in such a change of behavior, the rate of rank 
convergence of D(n) is close to optimal very soon, even for the relatively 
"small" sets of Turing machines for small n. 

One may ask how robust the complexity values and classifications may be 
in the face of changes in computational formalism (e.g. Turing machines with 
several tapes, and all possible variations). We have shown [23] that radical 
changes to the computing model produce reasonable (and correlated with 
different degrees of confidence) ranking distributions of complexity values 
(using even completely different computing models, such as unidimensional 
deterministic cellular automata and Post tag systems). 

We have also calculated the maximum differences between the Kolmogorov 
complexity evaluations of the strings occurring in every 2 distributions D(n) 
and D(n + 1) for n = 2, ... ,4. This provides estimations for the constant 
c in the invariance theorem (Eq. [2]) determining the maximum difference in 
bits among all the strings evaluated with one or another distribution, hence 
shedding light on the robustness of the evaluations under this procedure. 
The smaller the values of c the more stable our method. The values of these 
bounding constants (in bits) among the different exact evaluations of K us- 
ing D{n) for n = 2, ... ,5 after application of the Coding theorem (Eq. [3]) 
are: 
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\K D(2 )(s) - K D{3) (s)\ < c = 4.090; 4.090; 3.448; 0.39 
\K D{3) (s) - K Dii) (s) \ <c = 4.10; 3.234; 2.327; 2.327 
\K D{4) (s) - K D{5) {s)\ < c = 5.022; 4.274; 3.40; 2.797 

Where K D ^(s) means K(s) evaluated using the output frequency distri- 
bution D{n) after application of the Coding theorem (Eq. [3]) for n — 2, . . . , 5 
(n — 1 is a trivial non interesting case) and where every value of c is cal- 
culated by quartiles (separated by semicolons), that is, the calculation of c 
among all the strings in the 2 compared distributions, then among the top 
3/4, then the top half and finally the top quarter by rank. Notice that the 
estimation of c between D(2) and -D(3), and D(3) and D(A) remained al- 
most the same among all strings occurring in both, at about 4 bits. This 
means one could write a "compiler" (or translator) among the two distribu- 
tions for all their occurring strings of size only 4 bits providing one or the 
other complexity value for K based on one or the other distribution. The 
differences are considerably smaller for more stable strings (towards the top 
of the distributions). One may think that given that the strings with their 
occurrences in D(n + 1) necessarily contain those in D(n) for all n (because 
the space of all Turing machines with an additional state always contain the 
computations of the Turing machines will less states), the agreement should 
be expected. However, the contribution of D(n) to D(n + 1) contributes with 
about log the number of strings in D(n + 1). For example, D(4) contributes 
only 1832 strings to the 99 608 produced in D(5) (that is less than 2%). All 
in all, the largest difference found between D(A) and -D(5) is only of 5 bits of 
among all the strings occurring both in D(4) and D(5) (1832 strings), where 
the values of K in D(4) are between 2.285 and 29.9. 

13. Conclusions 

Changing the programming language (or universal Turing machine) makes 
K(s) very unstable for short strings, which prevents from getting a sense of 
the complexity of these strings. However, the concept of algorithmic proba- 
bility and Levin's universal distribution (m(s)) varies very little in the pro- 
duction of strings, because it is the result of an operation that makes incre- 
mental changes to the frequencies from a very large number of calculations 
from a multitude of Turing machines. The chief advantage of evaluating 
K m ( s j rather than K(s) in general (for example, using an arbitrary universal 
Turing machine) is that m(s) may be as sensitive to the additive constant 
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involved in the invariance theorem for different choices, but it seems more 
natural to calculate m(s) with an enumeration of Turing machines of in- 
creasing size (the traditional quasi-lexicographical order) than choosing or 
constructing an arbitrary universal Turing machine or Turing-complete pro- 
gramming language. In the case of the enumeration, for example, fixed a size 
of Turing machines, one can go through every machine in the enumeration, 
hence making the particular order of the enumeration irrelevant. 

We've also shown that the procedure seems robust (small variation be- 
tween different sets of Turing machines of different size) and that the results 
are in accordance with our intuitions of complexity and randomness. D{5) 
is essentially as we expected; it provides a precise and objective notion of 
what a "simple string" is as compared to a "random (Kolmogorov complex) 
string", and it is not a trivial accomplishment to provide such a deep in- 
sight into the world of structure versus randomness while at the same time 
attaching a numerical and formal meaning to such concepts. 

The method described here is computationally expensive, but it doesn't 
need to be executed more than once (and we have already done so for up to 
5 state Turing machines), obviating the need to recalculate the complexity 
values of the strings that have already been calculated. The classification 
provides a prior distribution that should prove to be of general use and 
application. As a result we now have two complementary techniques for ap- 
proximating K, the traditional lossless compression algorithm technique and 
now the one described in this paper What remains to be done is to put 
these together and show that they actually work in harmony when they over- 
lap over string lengths where both may provide reasonable approximations. 
Also needed are extensions of the current model that provide insight into and 
formal approximations of non-binary and n-dimensional objects other than 
unidimensional strings (e.g. images, for which we have also provided some 
important steps, see [23]). We are confident that the technique will prove to 
be of great value in combination with compression methods, making them 
stronger than either would on its own. 

In summary, this procedure seems to be a possible alternative to having 
to choose an universal Turing machine U in order to evaluate Ku, and is com- 



5 An Online Algorithmic Complexity Calculator implementing the technique presented 
herein and making the data available to the research community is accessible at http : 
//www. complexitycalculator . com 
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plementary to the compression approach that is ineffective for short strings. 
The technique advanced in this paper provides a reasonable and consistent 
approximation to K(s) that is in agreement with the theory and represents 
evidence in confirmation of Levin's distribution and Solomonoff 's universal 
induction. 
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